CIND820 Capstone Project:

Implementing Machine Learning Price Prediction with the Ames Housing Dataset

Setting Up Environment

Importing necessary libraries and modules

There were some warning messages that repeated constantly later in the code so, though I'm aware of the issues, I ignore these warnings here.

Setting up display options for pandas

1. Basic Data Understanding & Data Cleaning

Loading and checking over dataset

Checking for Duplicate Values

Setting the target variable

Basic Summary Statistics for Target

Creating indexes for discrete, continuous, ordinal, and nominal features while displaying the number of each type of feature.

1. Basic Summary Statistics for Discrete Features

A garage built in 2207? Either there is some time travelling going on or this is a mistake.

Handling discrete NaNs

In the case of categories like Basement Full Bath and Garage Cars, we would expect that a NaN value indicates the absence of that particular feature. As such, we fill the missing values with 0.

The one exception is for Garage Year Built where a zero value wouldn't make sense. Though it probably isn't a perfect solution I decided to simply fill Na values with Year Built.

Basic Summary Statistics for Continuous Features

Handling Continuous NaNs

Similar to discrete variables, NaN values for all of the continuous features here likely indicates an abscence of that particular feature. For example, Na values for Total Bsmt Square Footage or Masonry Veneer indicates that the house does not have any basement or veneer. Again, I fill these na values with 0.

There is something curious though about the large number of Na values for Lot Frontage. This seemed strange to me so I checked a random number of records with NaN values for Lot Frontage using the assessor records that are available via the City of Ames website.

After checking the assessor records(overhead photos), all of the houses in my sample clearly have lot frontage on a street. Some have irregular shapes such as crescent lots but most of these still have some square footage numbers for frontage. In other respects, the assessor records seem to match those in the dataset. This seems to be an error in data collection.

To resolve this in the present context, I will later use a KNN imputer for lot frontage to fill formerly Na values.

Examining Value Counts for Ordinal Features

I create a quick function that displays all the value counts for a group of features.

Handling Ordinal NaNs

For most of the ordinal features an NaN value indicates the absence of that feature so I fill the missing values with 'None'.

There is one exception. One single record has a NaN value for Electrical. It is very unlikely that this means that the house has no electrical system. It is necessary then to check the record in question.

Given that this particular house was built in 2006, it is almost certain that the house would have a standard breaker box. Just to check this though I will take a quick look at electrical system by year built.

As I thought, the most recent house built without a standard breaker was built in 1965. It is safe then to assume that our mystery house has a standard breaker.

Examining Value Counts for Nominal Features

Handling Nominal NaNs

I fill NaN nominal values with 'None' as it is likely that these represent an absence of a feature such as a garage or alley access.

Again I came across a discrepancy. There are several houses that have 'None' for Mas Vnr Type yet there is a non-zero value for Mas Vnr Area.

Since there are only 5 records, I check the Ames City Assessor records again and examine photos of each of the houses and then fill im the appropriate Mass Vnr Type or alternatively set the Mas Vnr Area to 0.

Note:Now Taken care of @ Discrete Features section - all Garage Yr Built Na filled with Year Built There was one other discrepency that I came across. There is one record which has a Garage Type value and a non-zero Garage Area value but 0 for Garage Yr Blt.

Dropping ID(Order) and Property Identifier(PID) from dataframe

Re-coding any ordinal features that were not already numeric levels

Re-coding categorical features that had been coded with numbers

Simplifying Multilevel Categorical Features

I originally thought that simplifying a number of the ordinal features so as to reduce levels with very few values would be a good idea. After doing some modelling, I decided to scrape this experiment.

These Na values are those for Lot Frontage they will be taken care of in the second data prep phase via the KNN imputer.

2. Exploratory Data Analysis

Scatter Plots Showing Relationship Between Features of Expected Importance and Sale Price

We can see here that there is a linear positive relationship between Greater Living Area and Sale Price.

There are, however, some extreme outliers - specifically 3 houses with square footage over 4500.

There is also a linear relationship between Lot Area and Sale Price.

Again there are a number of outliers especially the 4 values greater than 100000

There is a linear relationship for the most part between Total Basement Square Footage and Sale Price. It does not appear to be quite as tight and there are a lot of 0 values.

This plot between Garage Area and Sale Price seems quite similar in shape to the scatter plot for Total Basement Square Footage and Sale Price but more spread out. It seems to be a fairly linear relationship with quite a few zero values.

In the last of the scatter plots, I wanted to see what plot would result if AboveGrade Living Area and Total Basement Square Footage were combined. The result is interesting as it shows a very strong linear relationship between Total Square Footage and Sale Price.

Boxplots Showing Relationship Between Categorical Variables of Interest and Sale Price

House Style

First we take a look at the relationship between House Style and Sale Price. We can see that it is possible to purchase nearly any style of home for a price in a band of about $140,000 to $200,000 in Ames. The price band for one-story and two-story homes varies considerably starting under $50,0000 and going up to the mid-$300,000s. Interestingly, while 2.5 Finished and 2.5 Unfinished price bands are higher than that for 2Story homes, 2Story homes have a significantly higher ceiling. The remaining house styles have relatively tighter pricing bands.

Neighborhood

Next we look at Neighborhood. Immediately it is apparent that Stone Brook, Northridge Heights, and Northridge are the priciest neighborhoods with median home prices over $300,000. Greenhills is only slightly cheaper. Otherwise, homes can be purchased in most neighborhoods within our price band of $150,000 to $200,000. Aside from the upper class areas, only the upper-middle class neighborhoods of Somerset,Timber,Veenker and College Crescent have median prices above the $200,000 mark. Cheaper neighborhoods are Briardale, Meadow Village, Brookside, Old Town, Edwards, and South West of Iowa State University. It makes sense that Iowa DOT and Railroad is the cheapest place to buy a house in Ames.

This chart gives us a different perspective on Neighborhood by showing the average price by neighborhood and the frequency of homes along with a median line that allows us to compare the median home price in all of Ames.

Zoning

We can see that while homes in Residential Low-Density have a wider price-band, they generally tend to have higher prices than Residential High-Density and Residential Medium-Density. The Floating Village zoning has the highest prices. (Note: Floating Village here refers to a type of planned community with various amenities built-in to the neighborhood)

Not surprisingly, homes that are in Commercial, Industrial or Agricultural zones tend to be cheaper.

Here we can see the majority of homes sold in Ames between 2006 and 2010 were built since the 1960s with a considerable number built around the turn of the millennium.

This set of boxplots gives us a bit more detail in terms of the price of a home depending on year built. As is perhaps obvious, newer homes tend to be more expensive. It is interesting though that houses built in 1892 seem to fetch higher prices than we might expect.

In this chart we can see that house prices actually tend to decline in Ames in the 2006-2010 period with a fall in prices since 2007 which corresponds to the beginning of the US Housing Finance Crisis.

This period is significantly different than the long-term housing price trend in Iowa. As we can see in this graph provided by the Federal Reserve.

Frequency Distributions and Histograms

Looking at the frequency distribution of Sale Price we can see that our target variable has a significant right-skew.

The probability plot confirms that SalePrice does not follow a normal distribution

I attempted to perform a log transformation on both the target variable (Sale Price) and significantly skewed independent variables. However, this resulted in both warnings and exceptions elsewhere in the code. This might be because log(x) approaches negative infinity as the value of x becomes closer to zero. (see http://onbiostatistics.blogspot.com/2012/05/logx1-data-transformation.html)

As an alternative, I utilized a log(x+1) transformation. (see https://www.kaggle.com/apapiu/house-prices-advanced-regression-techniques/regularized-linear-models)

This shifts the distribution to the right resulting in greater normality.

Looking at our histograms of discrete and continuous variables, we can see that most of the continuous features are right-skewed. It would be good then to transform the most skewed of these variables using log(price+1) as we will do with our target variable.

There are a lot of discrete features (along with Mas Vnr Area) that are dominated by 0 values. This makes sense as in most cases homes do not have a pool or an enclosed porch. Nevertheless, it is probably doubtful that these zero dominated features will be important when developing our predictive models.

I thought I would check a couple of individual histograms to see if any more detail emerges if we zoom in. We can see a bit more detail for these two features distributions but really the large number of zero values predominates.

Correlation Matrices

We begin by looking at an extremely large correlation matrix which includes all features in our data set.

While it is a bit chaotic, a few things stand out. First, there are some features which have a high level of collinearity. Most are fairly easy to understand: Positive: Pool Area & Pool Quality Garage Quality & Garage Condition Garage Area & Garage Cars Year Built & Year Garage Built Total Rooms Above Ground & Total Area Above Ground Basement Finish Type Square Footage & Basement Type Square Footage

Negative: Basement Unfinished Square Footage & Basement Full Bathrooms Basement Unifinished Square Footage & Basement Finish Type Square footage Year Built & Overall Condition

A bit more interesting is the close relationship between Total Basement Square Footage and First Floor Square Footage.

As well, there is a negative relationship between Enclosed Porch and Year Built.

Matrix of 21 Features Most Correlated with Target

Next we zoom in to look specifically at the 21 features with the highest level of correlation with our target variable - Sale Price.

We see that there are very strong correlations between Total Indoor Square Footage, Overall Quality, and AboveGround Living Area Square Footage.

Numeric Feature Correlation Matrix

Here we can again see problems of multicollinearity.
Particularly problematic are: Positive Garage Area and Garage Cars Total Basement Square Footage and 1st Flr Square Footage Total Rooms Above Ground and AboveGround Living Area

Negative Basement Unfinished Square Footage and Basement Finished Square Footage Basement Unfinshed Square Footage and Basement Full Bathrooms

We also see significant collinearity between Total Indoor Square Footage and AboveGround Living Area Square Footage, Total Basement Square Footage and 1st Floor Square Footage.

Categorical Feature Correlation Matrix Using Kendall Rank Method

Our correlation matrix using the Kendall Rank correlation for categorical features suggests that there is not a lot of monotonicity between categorical features.

The only variables that show a degree of monotonicity are Kitchen Quality with External Quality and Garage Condition with Garage Quality.

The last thing that I wanted to do with the EDA was a quick pair plot just to get another look a bunch of the key variables.

3. Data Preparation II

First, I remove extreme outliers as well as the 1st and 99th percentile of SalePrice. Next, I consolidate the levels of several nominal variables. I separate our dataset into object and nonObject features. I use a KNN imputer to fill the Na values for Lot Frontage. Make a quick check to make sure there are no Na values. I then one-hot encode the object features using .getdummies.

Next I take a brief look at some of the most skewed features among the nonObject features before performing log(price+1) transformation for features with greater than 0.6 abs skew value.

Finally, I perform a log(price+1) transformation on the target variable.

The index needs to be reset after removing records

I had originally filled Na values for Lot Frontage with the median but I decided later to use a KNN imputer for these values.

Normal probability plot shows that the log transformation has helped to make Sale Price better approximate a normal distribution.

Creating Test and Train Sets

We concatenate our object and nonObject features and then create train and test sets using 70% for train and 30% for test.

Standardization

Next we standardized our data to prevent large discrepencies in scale between various features causing distortions. Note this is done after train/test split in order to prevent data leakage. Also, while fit_transform is used on the X_train data, transform is used on the X_test data to again ensure no data leakage.

4. Feature Selection

Selecting Basic Features

Based on the correlation matrices above I have selected 16 variables for a preliminary multiple linear regression model. I took the best correlated variables

Basic Features:

  1. Overall Quality
  2. AboveGrade Living Area Square Footage
  3. External Quality
  4. Total Basement Area Square Footage
  5. Garage Area
  6. Basement Quality
  7. Year Built
  8. Garage Finish
  9. Full Bathrooms
  10. Year Remodel/Addition
  11. Fireplace Quality
  12. Masonry Veneer Area
  13. Heating Quality
  14. Basement Finish 1 Square Footage
  15. Lot Frontage
  16. Lot Area

In this section I try two different methods of feature selection using the sklearn feature.selection module's SelectFromModel.

Random Forest Selector

By default the random forest regressor will identify features that produce significantly higher decreases in mean squared error than an average of all features.

Hyperparameter tuning

Through trial and error I found that progressively lowering max features (@each split), max depth of trees, and max samples (number of samples drawn from X to train each base estimator) works best. This resulted in progressively better RSME scores particularly on the training data(see final model below). However, when I pushed these values too low, suddenly the RSME for test shot up to 50,000.

In terms of number of trees, I found that 50,000 trees produced very slightly better results than 10,000 but taking about double the amount of time. As such I went with the 10,000 trees.

Through all my trials the threshold for selection stayed pretty stable at 0.0038

The number of selected features went up considerably with hyperparameter tuning. As the number of features increased, the relative importance of most important features dropped from for Overall Quality and Total Indoor Square footage.

I think that by applying a more rigorous method of hyperparameter fine tuning it may be possible to further optimize this selector.

Lasso Selector

For the LassoCV regressor, I set intial alpha values then test them to obtain optimal alpha scores.

I then take my improved model and use it with SelectFromModel. The key hyperparameter here is the threshold value which represents the cut off in terms of significant features. At a high level such as 0.25, the lasso model returns zero features. In order to tune the threshold parameter, I implemented a while loop that progressively lowered the value until 61 features were obtained.

Here is the list of the top coefficients.

The contrast between the variables selected by the Random Forest selector and the Lasso selector is very interesting.

The Random Forest selector tended to pick mainly numeric(nonObj) variables and the most important of these tend to be somewhat similar to what we might expect based on correlations with Sale Price.

While the Lasso Selector did choose variables such as Total SF and Overall Quality, it also chose a large number of categorical(obj) variables. Some of these were judged to be very important such as Neighborhood_Crawford, Exterior 1st_BrkFace, and Sale Condition Abnormal. I found it particularly interesting that Neighborhoods assumed such an increased importance as this group of features had been ignored by the Random Forest Selector.

I believe this points to the Random Forest selector favoring numeric variables with a high level of cardinality.

Main Feature Selection

Here I set up my main feature selection which includes 40 variables. The commented out lists of features represent alternative selections of features that I had tried earlier.

5. Preliminary Regression Models

In this section I create a series of basic multiple regression models with different selections of features.

Create K-Fold CV Scorers

Before running the preliminary models I set up my K-Fold Cross Validation Scoring Metrics (Root Mean Squared Error, Mean Absolute Error,Adjusted R2).

i. Multiple Linear Regression Model Using All Features

Originally, I had intended to include a Multiple Linear Regression model that included all 81 features. This though did not work at all and resulted in a broken model. While the RSME for the test data was fairly normal, the RSME for the test data ballooned . The R2 for the training data was negative which indicates that the model is substantially worse than the mean value of Sale Price as a predictor.

Extremely strange residuals plot and fitted-line chart.

ii. Multiple Linear Regression Model with Basic Feature (Highest Correlations with SalePrice)

As an alternative for my base linear model, I took the top variables correlated with Sale Price minus several variables that had a high level of collinearity. This left me with 16 variables for the base model.

Overall, the base model performed fairly well.

We can see that the residuals are fairly tightly packed around the mean and there is no discernable pattern.

Below we can see that the model works fairly well.

iii. Multiple Linear Regression Model with Random Forest Feature Selection

The multiple regression model with Random Forest selected variables does perform better pushing the RSME on the test data to 0.1156826. Of course, it includes a far larger number of features(71). Again the residuals are tighly grouped around the mean error.

iv. Multiple Linear Regression Model with Lasso Regression Feature Selection

The last of the preliminary models creates a multiple regression model using the features picked by the Lasso selector. This set of features gives us the best results thus far.

Preliminary Model Results

Here then are all of the preliminary results

6. Main Models

We now move on to the main models. For all models, I set a K number for cross validation and a variable so as to easily try different feature sets. Note: the commented out bit is a variety of other feature selections that I tried out.

i. Multiple Linear Regression

I run the model once using the timeit function which the average time of 7 runs.

Next I implement the model followed by the CV scorers.

Set up predictions for entire train and test set for purpose of charting.

We can see that the plots are very similar to the preliminary regression models.

The results are very impressive. The only issue with the multiple linear regression model is that with some of the larger feature sets the model breaks (as it did with all features included)

ii. Lasso Regression

Here, I implement the Lasso Regressor. While for other models I utilized RandomizedSearchCV in order to test different varaibles, with Lasso model I just have to test the alpha score which can simply be done with a test run of the Lasso model which itself is a cv model.

I check the time it takes to run the Lasso model.

Run the Lasso model followed by CV scorers

The residuals plot looks fine.

I thought I would check to see what features the model was using.

The results are very good though just a little bit better than plain Multiple Linear Regression.

iii. Decision Tree Regressor

Starting with the Decision Tree I use RandomizedSearchCV to narrow down the optimal parameters after a bit of trial and error.

I then visualize the loss function to see how adding levels to the tree depth works in relation to MSE.

After the preliminary test is over, I time the final model and then implement it again for the various cross validation scorers.

Here I visualize the entire 8 levels of the regression tree. The resulting chart is really big and would be difficult to show in a report or presentation so I created a representation of the first 3 levels to give an idea of how the tree model proceeded.

The residuals seem a bit more spread out but I think it is still okay.

This line chart struck me as a bit funny. The points are all really concentrated in area but that concentration is quite fat.

The results for the Regression Tree are disappointing in comparison with the other models.

iv. Random Forest Regressor

I follow much the same procedure here beginning with trying out a number of different parameters.

Again I visualize how the model progresses. Note that here I am using an Out of Bag Error rate.

Check the time for the random forest model

Next I implement the model and run the CV scorers.

Here I visualized the last of the forest's tree predictors. This is an even bigger tree with a depth of 50. It is quite different than the Regression Tree model which makes sense as the point of the random forest is to create a large number of different trees to capture (reduce) variation in the dataset.

I just wanted to check to see the max depth for the tree above.

Again the residuals seem a little funny but I don't believe there is any clear pattern.

The data points are tighter for the random forest and not quite as fat.

Fairly decent results although random forest doesn't perform quite as well as plain linear regression and requires far more time.

v. Gradient Boosting (XGBoost)

I decided to use XGBoost instead of the scikit-learn's gradient boosting algorithm because it apparently runs faster, is more memory efficient and offers better predictive performance. Conveniently XGBoost does have a sklearn wrapper.

With the XGBoost, I had some trouble when I tried to chart a more developed model. As such, I reversed my normal procedure and start by charting the model's loss function. We can see the the rmse score declining as the model runs through its training iterations.

I decided that I would go with a larger n estimators size than the loss function suggested. In trials the results were a bit better.

After the timing the model, I implement it again along the CV scorers.

The residuals plot looks better than the one for random forest.

Appearance wise this looks pretty similar to the random forest fitted line chart.

Quite good results that are right up there with Lasso and Multiple Regression in terms of effectiveness. Of course, the model does take a bit more time but is still quite efficient for an ensemble model.

vi. Artificial Neural Network Perceptron (Keras)

I utilize the GPU in order to speed up caculations.

Here I define a basic model and then experiment with RandomizedSearchCV to find optimal parameters. Given the time required to run the models this took a while.

Note: For most models I have left the RandomizedSearchCV code but for the neural network I commented it out as it takes so long.

I was having problems with my video card still tying up all its memory after I ran a model. This helped release the memory.

After trying to figure out what setting might work best, I implement my model. While I did try a model with 2 hidden layers, the results with a single layer of 85 nodes worked the best. Note I am using the sklearn regressor wrapper for Keras.

Here I chart the loss function for the ANN. In contrast to the other models, this was not so helpful. The chart suggests that the RMSE should decline below 0.025 for both train and test data but I could not get anywhere close to those results with my CV scorers.

In practive I found that a model with 100 epochs and a batch size of 25 produced the best results.

Here I created a very basic visualization of the ANN model. It really doesn't tell us very much. I need to do some future work to find better ways of representing neural network models. I believe that Tensor Board might be the way to go but I need to do some further work to figure it out.

This residuals plot is concerning as the center of the distribution is a fair bit higher than the mean.

The results are okay. It provides better effectiveness metrics than the regression tree. However, it is absolutely the worst model in terms of time efficiency.

7. The Final Results

Here then are the final results for all models at the particular k level set earlier.

Lastly, I export the results to an excel file which is handy for creating tables for PowerPoint and Word.